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Abstract 

We consider the online distributed non-stochastic experts problem, where the distributed 
system consists of one coordinator node that is connected to k sites, and the sites are required 
to communicate with each other via the coordinator. At each time-step t, one of the k site 
nodes has to pick an expert from the set {1, ...,n}, and the same site receives information 
about payoffs of all experts for that round. The goal of the distributed system is to minimize 
regret at time horizon T, while simultaneously keeping communication to a minimum. The 
two extreme solutions to this problem are: (i) Full communication: This essentially simulates 
the non-distributed setting to obtain the optimal 0(y/\og(n)T) regret bound at the cost of 
T communication, (ii) No communication: Each site runs an independent copy - the regret 
is 0(y/log(n)kT) and the communication is 0. This paper shows the difficulty of simultane- 
ously achieving regret asymptotically better than \fkT and communication better than T. We 
give a novel algorithm that for an oblivious adversary achieves a non-trivial trade-off: regret 
0(v / /c 5 ( 1+£ )/ 6 T) and communication 0(T/k e ), for any value of e E (0, 1/5). We also consider 
a variant of the model, where the coordinator picks the expert. In this model, we show that 
the label-efficient forecaster of Cesa-Bianchi et al. (2005) already gives us strategy that is near 
optimal in regret vs communication trade-off. 



1 Introduction 

In this paper, we consider the well-studied non-stochastic expert problem in a distributed setting. 
In the standard (non-distributed) setting, there are a total of n experts available for the decision- 
maker to consult, and at each round t = 1, . . . ,T, she must choose to follow the advice of one of 
the experts, say a', from the set [n] = {1, . . . , n}. At the end of the round, she observes a payoff 
vector p* G [0, l] ra , where p*[a] denotes the payoff that would have been received by following the 
advice of expert a. The payoff received by the decision- maker is p*[a*]. In the non-stochastic 
setting, an adversary decides the payoff vectors at any time step. At the end of the T rounds, 
the regret of the decision maker is the difference in the payoff that she would have received using 
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the single best expert at all times in hindsight, and the payoff that she actually received, i.e. 
R = max ag [ n ] Ylt=i P* H ~~ Ylt=iP t [ at ]- The goal here is to minimize her regret; this general 
problem in the non-stochastic setting captures several applications of interest, such as experiment 
design, online ad-selection, portfolio optimization, etc. (See [TJ [2J El SI E] and references therein.) 

Tight bounds on regret for the non-stochastic expert problem are obtained by the so-called 
follow the regularized leader approaches; at time t, the decision-maker chooses a distribution, x*, 
over the n experts. Here x* minimizes the quantity X^s=i P* • x + r ( x )> where r is a regularizer. 
Common regularizers are the entropy function, which results in Hedge [lj or the exponentially 
weighted forecaster (see chap. 2 in {2j), or as we consider in this paper r(x) = fj ■ x, where 
fj £r [0,rj] n is a random vector, which gives the follow the perturbed leader (FPL) algorithm [6]. 

We consider the setting when the decision maker is a distributed system, where several different 
nodes may select experts and/or observe payoffs at different time-steps. Such settings are common, 
e.g. internet search companies, such as Google or Bing, may use several nodes to answer search 
queries and the performance is revealed by user clicks. From the point of view of making better 
predictions, it is useful to pool all available data. However, this may involve significant communica- 
tion which may be quite costly. Thus, there is an obvious trade-off between cost of communication 
and cost of inaccuracy (because of not pooling together all data), which leads to the question: 

What is the explicit trade-off between the total amount of communication needed and the regret of 
the expert problem under worst case input? 



2 Models and Summary of Results 

We consider a distributed computation model consisting of one central coordinator node connected 
to k site nodes. The site nodes must communicate with each other using the coordinator node. 
At each time step, the distributed system receives a guen^J which indicates that it must choose 
an expert to follow. At the end of the round, the distributed system observes the payoff vector. 
We consider two different models described in detail below: the site prediction model where one 
of the k sites receives a query at any given time-step, and the coordinator prediction model where 
the query is always received at the coordinator node. In both these models, the payoff vector, p*, 
is always observed at one of the k site nodes. Thus, some communication is required to share the 
information about the payoff vectors among nodes. As we shall see, these two models yield different 
algorithms and performance bounds. 



Goal: The algorithm implemented on the distributed system may use randomness, both to decide 
which expert to pick and to decide when to communicate with other nodes. We focus on simulta- 
neously minimizing the expected regret and the expected communication used by the (distributed) 
algorithm. Recall, that the expected regret is: 



E[R] = E 



max >^ p* [ 



t=i 



(1) 



where the expectation is over the random choices made by the algorithm. The expected commu- 
nication is simply the expected number (over the random choices) of messages sent in the system. 



1 We do not use the word query in the sense of explicitly giving some information or context, but merely as 
indication of occurrence of an event that forces some site or coordinator to choose an expert. In particular, if any 
context is provided in the query the algorithms considered in this paper ignore all context - thus we are in the 
non-contextual expert setting. 
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As we show in this paper, this is a challenging problem and to keep the analysis simple we focus 
on bounds in terms of the number of sites k and the time horizon T, which are often the most im- 
portant scaling parameters. In particular, our algorithms are variants of follow the perturbed leader 
(FPL) and hence our bounds are not optimal in terms of the number of experts n. We believe that 
the dependence on the number of experts in our algorithms (upper bounds) can be strengthened 
using a different regularizer. Also, all our lower bounds are shown in terms of T and k, for n = 2. 
For larger n, using techniques similar to Theorem 3.6 in pj should give the appropriate dependence 
on n. 

Adversaries: In the non-stochastic setting, we assume that an adversary may decide the payoff 
vectors, p*, at each time-step and also the site, s*, that receives the payoff vector (and also the 
query in the site-prediction model). An oblivious adversary cannot see any of the actions of the 
distributed system, i.e. selection of expert, communication patterns or any random bits used. How- 
ever, the oblivious adversary may know the description of the algorithm. In addition to knowing 
the description of the algorithm, an adaptive adversary is stronger and can record all of the past 
actions of the algorithm, and use these arbitrarily to decide the future payoff vectors and site allo- 
cations. 

Communication: We do not explicitly account for message sizes. However, since we are interested 
in scaling with T and k, we do require that message size should not depend on the number of sites k 
or the number of time-steps T, but only on the number of experts n. In other words, we assume that 
n is substantially smaller than T and k. All the messages used in our algorithms contain at most n 
real numbers. As is standard in the distributed systems literature, we assume that communication 
delay is 0, i.e. the updates sent by any node are received by the recipients before any future query 
arrives. All our results still hold under the weaker assumption that the number of queries received 
by the distributed system in the duration required to complete a broadcast is negligible compared 
toJfc.0 

We now describe the two models in greater detail, state our main results and discuss related 
work: 

1. Site Prediction Model: At each time step t = 1, . . . ,T, one of the k sites, say s t , receives a 
query and has to pick an expert, a*, from the set, [n] = {1, . . . , n}. The payoff vector p* £ [0, l] ra , 
where p*[i] is the payoff of the i th expert is revealed only to the site s* and the decision-maker 
(distributed system) receives payoff p*[a*], corresponding to the expert actually chosen. The site 
prediction model is commonly studied in distributed machine learning settings (see El 12] ) • The 
payoff vectors, p 1 , . . . , p T , and also the choice of sites that receive the query, s 1 , . . . , s T , are decided 
by an adversary. There are two very simple algorithms in this model: 

(i) Full communication: The coordinator always maintains the current cumulative payoff vector, 
^t=iP T - At time step t, s* receives the current cumulative payoff vector X)r=i P T f rom the 
coordinator, chooses an expert a t E [n] using FPL, receives payoff vector p* and sends p* to the 
coordinator, which updates its cumulative payoff vector. Note that the total communication is 2T 
and the system simulates (non-distributed) FPL to achieve (optimal) regret guarantee 0{^/nT). 

(ii) No communication: Each site maintains cumulative payoff vectors corresponding to the queries 
received by them, thus implementing k independent versions of FPL. Suppose that the i th site 

2 This is because in regularized leader like approaches, if the cumulative payoff vector changes by a small amount 
the distribution over experts does not change much because of the regularization effect. 
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receives a total of Tj queries (X)i=i ^ = the regret is bounded by J^Li 0(y/nTi) = O(VnkT) 
and the total communication is 0. This upper bound is actually tight, as shown in Lemma [3] 



(Appendix C.2.1), in the event that there is communication. 



Simultaneously achieving regret that is asymptotically lower than y/knT using communication 
asymptotically lower than T turns out to be a significantly challenging question. Our main positive 
result is the first distributed expert algorithm in the oblivious adversarial (non-stochastic) setting, 
using sub-linear communication. Finding such an algorithm in the case of an adaptive adversary 
is an interesting open problem. 

Theorem 1. When T > 2/c 2 ' 3 , there exists an algorithm for the distributed experts problem 
that against an oblivious adversary achieves regret 0(log(n) v / /c 5 ( 1+e )/ 6 T) and uses communication 
0{T/k e ), giving non-trivial guarantees in the range e G (0,1/5). 

2. Coordinator Prediction Model: At every time step, the query is received by the coordinator 
node, which chooses an expert a* G [n]. However, at the end of the round, one of the site nodes, 
say s*, observes the payoff vector p*. The payoff vectors p* and choice of sites s* are decided by 
an adversary. This model is also a natural one and is explored in the distributed systems and 
streaming literature (see [10 |, ITT1 IT2] and references therein). 

The full communication protocol is equally applicable here getting optimal regret bound, 
O(VnT) at the cost of substantial (essentially T) communication. But here, we do not have 
any straightforward algorithms that achieve non-trivial regret without using any communication. 
This model is closely related to the label-efficient prediction problem (see Chapter 6.1-3 in [2]), 
where the decision-maker has a limited budget and has to spend part of its budget to observe any 
payoff information. The optimal strategy is to request payoff information randomly with probabil- 
ity C/T at each time-step, if C is the communication budget. We refer to this algorithm as LEF 
(label-efficient forecaster) |13] . 

Theorem 2. ]13^ (Informal) The LEF algorithms using FPL with communication budget C achieves 
regret 0(TWn/C) against both an adaptive and an oblivious adversary. 

One of the crucial differences between this model and that of the label-efficient setting is that 
when communication does occur, the site can send cumulative payoff vectors comprising all previ- 
ous updates to the coordinator rather than just the latest one. The other difference is that, unlike 
in the label-efficient case, the sites have the knowledge of their local regrets and can use it to decide 
when to communicate. However, our lower bounds for natural types of algorithms show that these 
advantages probably do not help to get better guarantees. 



Lower Bound Results: In the case of an adaptive adversary, we have an unconditional (for any 
type of algorithm) lower bound in both the models: 

Theorem 3. Let n = 2 be the number of experts. Then any (distributed) algorithm that achieves 
expected regret o(y/kT) must use communication (T/k)(l — o(l)). 

The proof appears in Appendix [AJ Notice that in the coordinator prediction model, when 
C = T/k, this lower bound is matched by the upper bound of LEF. 

In the case of an oblivious adversary, our results are weaker, but we can show that certain 
natural types of algorithms are not applicable directly in this setting. The so called regularized 
leader algorithms, maintain a cumulative payoff vector, P 4 , and use only this and a regularizer to 
select an expert at time t. We consider two variants in the distributed setting: 

(i) Distributed Counter Algorithms: Here the forecaster only uses P , which is an (approximate) 
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version of the cumulative payoff vector P* . But we make no assumptions on how the forecaster will 
use P*. P* can be maintained while using sub-linear communication by applying techniques from 
distributed systems literature [TT] , 

(ii) Delayed Regularized Leader: Here the regularized leaders don't try to explicitly maintain an 
approximate version of the cumulative payoff vector. Instead, they may use an arbitrary commu- 
nication protocol, but make prediction using the cumulative payoff vector (using any past payoff 
vectors that they could have received) and some regularizer. 

We show in Section |3.2| that the distributed counter approach does not yield any non-trivial 
guarantee in the site-prediction model even against an oblivious adversary. It is possible to show a 
similar lower bound the in the coordinator prediction model, but is omitted since it follows easily 
from the idea in the site-prediction model combined with an explicit communication lower bound 
given in 

Section [4] shows that the delayed regularized leader approach does not yield non-trivial guar- 
antees even against an oblivious adversary in the coordinator prediction model, suggesting LEF 
algorithm is near optimal. 

Related Work: Recently there has been significant interest in distributed online learning questions 
(see for example [3|8l|9j). However, these works have focused mainly on stochastic optimization 
problems. Thus, the techniques used, such as reducing variance through mini-batching, are not 
applicable to our setting. Questions such as network structure [8] and network delays [9] are inter- 
esting in our setting as well, however, at present our work focuses on establishing some non-trivial 
regret guarantees in the distributed online non-stochastic experts setting. Study of communication 
as a resource in distributed learning is also considered in |14] IT5"1 116] ; however, this body of work 
seems only applicable to offline learning. 

The other related work is that of distributed functional monitoring [TO] and in particular 
distributed counting [TTJ [12], and sketching [T7] . Some of these techniques have been success- 
fully applied in offline machine learning problems |18j . However, we are the first to analyze the 
performance-communication trade-off of an online learning algorithm in the standard distributed 
functional monitoring framework [TO]. An application of a distributed counter to an online Bayesian 
regression was proposed in Liu et al. [TO]. Our lower bounds discussed below, show that approximate 
distributed counter techniques do not directly yield non-trivial algorithms. 

3 Site-prediction model 

3.1 Upper Bounds 

We describe our algorithm that simultaneously achieves non-trivial bounds on expected regret and 
expected communication. We begin by making two assumptions that simplify the exposition. First, 
we assume that there are only 2 experts. The generalization from 2 experts to n is easy, as discussed 
in the Remark [l] at the end of this section. Second, we assume that there exists a global query 
counter, that is available to all sites and the co-ordinator, which keeps track of the total number 
of queries received across the k sites. We discuss this assumption in Remark [2] at the end of the 
section. As is often the case in online algorithms, we assume that the time horizon T is known. 
Otherwise, the standard doubling trick may be employed. The notation used in this Section is 
defined in Table [T] 

Algorithm Description: Our algorithm DFPL is described in Figure [T^a). We make use of FPL 
algorithm, described in Figure Pub), which takes as a parameter the amount of added noise r\. 
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Symbol Definition 

p* Payoff vector at time-step t, p* E [0, l] 2 
£ The length of block into which inputs are divided 
b Number of input blocks b = T/i 

P l Cumulative payoff vector within block i, P* = J2t=(i-i)e+i P* 
Q* Cumulative payoff vector until end of block (i — 1), Q* = 2}=i P J 
M(f ) For vector u G M 2 , M(«) = 1 if «i > W2; M(v) = 2 otherwise 
FP*(r/) Random variable denoting the payoff obtained by playing FPL(t/) on block i 
FJV a (rj) Random variable denoting the regret with respect to action a of playing FPL(r/) on block i 

FR^(r / ) = P i [a]-FP i (r ? ) 
FR l (7/) Random variable denoting the regret of playing FPL(r/) on payoff vectors in block i 
FR* (??) = max 0=1)2 P* [a] - FP* (77) = max a=1 , 2 FR^ (77) 



Table 1: Notation used in Algorithm DFPL (Fig. [T]) and in Section 3.1 



DFPL(T, £, 77) 

set b = T/i; rf = VI: q = 2£ 3 T 2 /r 1 5 
for i = 1 . . . , b 

let Yi = Bernoulli^) 

if Yi = 1 then #step phase 

play FPL(rj') for time-steps (i — 1)1 + 1, . . . ,i£ 
else #block phase 

a 1 = M(Q i + r) where r £ R [0, T7] 2 

play a 1 for time-steps (i — 1)£ + 1, . . . , il 

pi _ spit t 

Q+ 1 = Q' + P { 



FPL(T, n = 2,7/) 
for i = 1, ...,T 

M(Et=iP T 



follow expert a* at time-step £ 
observe payoff vector p* 



where r £r [0, rjY 



(b) 



Figure 1: (a) DFPL: Distributed Follow the Perturbed Leader, (b) FPL: Follow the Perturbed Leader with 
parameter 77 for 2 experts (Af(-) is defined in Table [l] r is a random vector) 



DFPL algorithm treats the T time steps as b(= T/£) blocks, each of length £. At a high level, with 
probability q on any given block the algorithm is in the step phase, running a copy of FPL (with 
noise parameter ?/) across all time steps of the block, synchronizing after each time step. Otherwise 
it is in a block phase, running a copy of FPL (with noise parameter 77) across blocks with the same 
expert being followed for the entire block and synchronizing after each block. This effectively makes 
P l , the cumulative payoff over block i, the payoff vector for the block FPL. The block FPL has on 
average (1 — q)T/£ total time steps. We begin by stating a (slightly stronger) guarantee for FPL. 

Lemma 1. Consider the case n = 2. Let p 1 , . . . , p T S [0, l] 2 be a sequence of payoff vectors such 
that maxf |p*|oo < B and let the number of experts be 2. Then FPL(?y) has the following guarantee 
on expected regret, E[R] < ^ Ylt=i |p*[l] — p* [2] | -j- 77 . 

The proof is a simple modification to the proof of the standard analysis [6] and is given in 
Appendix [B] for completeness. The rest of this section is devoted to the proof of Lemma [2] 

Lemma 2. Consider the case n = 2. If T > 2k 2 ' 3 , Algorithm DFPL (Fig. [7]) when run with 
parameters £, T, 77 = £ 5 / 12 T 1 / 2 and b,r)',q as defined in Fig \A has expected regret 0(Vi 5 / 6 T) 
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and expected communication 0(Tk/£). In particular for I = k 1+e for < e < 1/5, the algorithm 
simultaneously achieves regret that is asymptotically lower than \fkT and communication that is 
asymptotically lowe^ than T. 

Since we are in the case of an oblivious adversary, we may assume that the payoff vectors 
p 1 , . . . ,p T are fixed ahead of time. Without loss of generality let expert 1 (out of {1,2}) be the 
one that has greater payoff in hindsight. Recall that FH\(rj') denotes the random variable that 
is the regret of playing FPL(r/) in a step phase on block i with respect to the first expert. In 
particular, this will be negative if expert 2 is the best expert on block i, even though globally 
expert 1 is better. In fact, this is exactly what our algorithm exploits: it gains on regret in the 
communication-expensive, step phase while saving on communication in the block phase. 

The regret can be written as 

b 

r = Y j ( y *- FR i(V) + (i - ^)(p*[i] - pvi) • 
i=i 

Note that the random variables Yi are independent of the random variables FR|(r/') and the random 
variables a 1 . As E[i$] = q, we can bound the expression for expected regret as follows: 

b b 

e[r) < qj^nmm + u - <?)^>[p*[i] - py]] (2) 

»=i 1=1 

We first analyze the second term of the above equation. This is just the regret corresponding 
to running FPL(?7) at the block level, with T/£ time steps. Using the fact that maxj |P*|oo < 
imaxt |p'|oo < Lemma [l] allows us to conclude that: 

b b 
J>[P*[1] - P>*]] < - |P l [l] - P*[2]| + r, (3) 

i=l ^ i=l 

Next, we also analyse the first term of the inequality Q. We chose rj = sfi (see Fig. [l| 
and the analysis of FPL guarantees that E[FR*(r/)] < 2^/1, where FR*(ry') denotes the random 
variable that is the actual regret of FPL(r/), not the regret with respect to expert 1 (which is 
FRi (?/)). Now either FR*(ry') = FK\(rj') (i.e. expert 1 was the better one on block i), in which case 
E[FR{(r/)] < 2v^; otherwise FR^r/) = FR|(r/) (i.e. expert 2 was the better one on block i), in 
which case E[FR{(t/)] < 2 yfl + P* [1] - P* [2] . Note that in this expression P*[l] -P i [2] is negative. 
Putting everything together we can write that E[FRi(r/)] < 2\fl- (P^2] -P i [l])+, where 

( x )+ = x 

if x > and otherwise. Thus, we get the main equation for regret. 

b b 
E[R]<2qbVI-q^2(P i [2]-P i [l]) + + -^2\P l [l]-P i [2]\+7 ] (4) 

1=1 ^ i=l 

V v ' V v ' 

term 1 term 2 

Note that the first (i.e. 2qb\fl) and last (i.e. n) terms of inequality Q are 0(V £ 5 / 6 T) for the 
setting of the parameters as in Lemma [2} The strategy is to show that when "term 2" becomes 
large, then "term 1" is also large in magnitude, but negative, compensating the effect of "term 1" . 

3 Note that here asymptotics is in terms of both parameters k and T. Getting communication of the form T 1 ^ 5 f(k) 
for regret bound better than V~kT, seems to be a fairly difficult and interesting problem 
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We consider a few cases: 

Case 1: When the best expert is identified quickly and not changed thereafter. Let £ denote the 
maximum index, i, such that Q*[l] — Q*[2] < rj. Note that after the block £ is processed, the 
algorithm in the block phase will never follow expert 2. 

Suppose that ( < (r//£) 2 . We note that the correct bound for "term 2" is now actually 
(£/rj) ELi ^l 1 ] - p i 2 ]l ^ (t 2 (/v) < V since |P*[1] - P J [2]| < £ for all i. 

Case 2 The best expert may not be identified quickly, furthermore |P 4 [1] — P*[2]| is large often. In 
this case, although "term 2" may be large (when (P*[l] — P l [2]) is large), this is compensated by 
the negative regret in "term 1" in expression Q. This is because if |P*[1] — P*[2]| is large often, 
but the best expert is not identified quickly, there must be enough blocks on which (P*[2] — P*[l]) 
is positive and large. 

Notice that ( > iv/?) 2 - Define A = rj 2 /T and let S = {i < ( | |P*[1] - P*[2]| > A}. 
Let a = \S\/(. We show that £!? = i( pi [ 2 ] ~ p i 1 ])+ > («CA)/2 - rj. To see this consider 
Sx = {< g S I P*[l] > P*[2]} andS 2 = S\Si. First, observe that £ ie5 |P^[1] -P*[2]| > a(\. Then, 
if Eies 2 ( p i 2 ] - PI 1 ]) > ("CA)/2, we are done. If not £ ieSi (P*[l] - P*[2]) > «A)/2. Now notice 

that Y%=i pi [!] - p i 2 ] < V, hence it must be the case that EiU( p i 2 ] ~ pi [!])+ > ("CA)/2 - m 
Now for the value of q = 2£ 3 T 2 /r) 5 and if a > rj 2 /(T£), the negative contribution of "term 1" is at 
least qa(^X/2 which greater than the maximum possible positive contribution of "term 2" which is 
l 2 C,/r]. It is easy to see that these quantities are equal and hence the total contribution of "term 
1" and "term 2" together is at most r\. 

Case 3 When |P l [l] — P*[ 2 ]| * s "small" most of the time. In this case the parameter r\ is actually 
well-tuned (which was not the case when |P*[1] — P l [2]| ~ £) and gives us a small overall regret. 
(See Lemma [lj) We have a < r] 2 /(T£). Note that a£ < A = rj 2 /T and that ( < T/L In this case 
"term 2" can be bounded easily as follows: | Y&=i I 1 *! 1 ] ~ P i 2 ]l ^ ;W + (1 - ")CA) < 2rj 
The above three cases exhaust all possibilities and hence no matter what the nature of the payoff 
sequence, the expected regret of DFPL is bounded by 0{rf) as required. The expected total commu- 
nication is easily seen to be 0(qT + Tk/£) - the q(T/£) blocks on which step FPL is used contribute 
0{£) communication each, and the (1 — q)(T/£) blocks where block FPL is used contributed 0(k) 
communication each. 

Remark 1. Our algorithm can be generalized to n experts by recursively dividing the set of experts 



in two and applying our algorithm to two meta-experts, as shown in Section C.l in the Appendix 



However, the bound obtained in Section C.l\ is not optimal in terms of the number of experts, n. 



This observation and Lemma^ imply Theorem^ 

Remark 2. The assumption that there is a global counter is necessary because our algorithm divides 
the input into blocks of size £. However, it is not an impediment because it is sufficient that the block 
sizes are in the range [0.99£, 1.0l£j. Assuming that the coordinator always signals the beginning and 
end of the block (by a broadcast which only adds 2k messages to any block), we can use a distributed 
counter that guarantees a very tight approximation to the number of queries received in each block 
with at most 0(k \og(£)) messages communicated (see TTTj). 

3.2 Lower Bounds 

In this section we give a lower bound on distributed counter algorithms in the site prediction model. 
Distributed counters allow tight approximation guarantees, i.e. for factor f3 additive approximation, 
the communication required is only 0(Tlog(T)^/k/ (3) [11]. We observe that the noise used by FPL 
is quite large, 0(y/T), and so it is tempting to find a suitable /3 and run FPL using approximate 
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cumulative payoffs. We consider the class of algorithms such that: 

(i) Whenever each site receives a query, it has an (approximate) cumulative payoff of each expert 
to additive accuracy (3. Furthermore, any communication is only used to maintain such a counter. 

(ii) Any site only uses the (approximate) cumulative payoffs and any local information it may have 
to choose an expert when queried. 

However, our negative result shows that even with a highly accurate counter (3 = 0{k), the non- 
stochasticity of the payoff sequence may cause any such algorithm to have Q(yJkT) regret. Further- 
more, we show that any distributed algorithm that implements (approximate) counters to additive 
error fc/10 on all sites^jis at least Q(T). 

Theorem 4. At any time step t, suppose each site has an (approximate) cumulative payoff count, 
P [a], for every expert such that |P [a] — P [a]| < /?. Then we have the following: 

1- If (3 < k, any algorithm that uses the approximate counts P*[a] and any local information at the 
site making the decision, cannot achieve expected regret asymptotically better than \Jj3T. 
2. Any protocol on the distributed system that guarantees that at each time step, each site has a 
(3 = /c/10 approximate cumulative payoff with probability > 1/2, uses S1(T) communication. 



4 Coordinator-prediction model 

In the co-ordinator prediction model, as mentioned earlier it is possible to use the label-efficient 
forecaster, LEF (Chap. 6 [21 [13]). Let C be an upper bound on the total amount of communication 
we are allowed to use. The label-efficient predictor translates into the following simple protocol: 
Whenever a site receives a payoff vector, it will forward that particular payoff to the coordina- 
tor with probability p ~ C/T. The coordinator will always execute the exponentially weighted 
forecaster over the sampled subset of payoffs to make new decisions. Here, the expected regret is 
0(Ty / log(n)/C). In other words, if our regret needs to be 0(-\/T), the communication needs to be 
linear in T. 

We observe that in principle there is a possibility of better algorithms in this setting for mainly 
two reasons: (i) when the sites send payoff vectors to the co-ordinator, they can send cumulative 
payoffs rather than the latest ones, thus giving more information, and (ii) the sites may decided 
when to communicate as a function of the payoff vectors instead of just randomly. However, we 
present a lower-bound that shows that for a natural family of algorithms achieving regret 0{y/T) 
requires at least il(T 1_e ) for every e > 0, even when k = 1. The type of algorithms we consider 
may have an arbitrary communication protocol, but it satisfies the following: (i) Whenever a site 
communicates with the coordinator, the site will report its local cumulative payoff vector, (ii) When 
the coordinator makes a decision, it will execute, FPL(\/T), (follow the perturbed leader with noise 
y/T) using the latest cumulative payoff vector. The proof of Theorem [5] appears in Appendix [d| 
and the results could be generalized to other regularizers. 

Theorem 5. Consider the distributed non-stochastic expert problem in coordinator prediction 
model. Any algorithm of the kind described above that achieves regret 0(y/T) must use r2(T 1_e ) 
communication against an oblivious adversary for every constant e. 
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Figure 2: (a) - Cumulative regret for the MC sequences as a function of correlation A, (b) 
cumulative regret vs. communication cost for the MC and zig-zag sequences. 
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5 Simulations 

In this section, we describe some simulation results comparing the efficacy of our algorithm DFPL 
with some other techniques. We compare DFPL against simple algorithms - full communication 
and no communication, and two other algorithms which we refer to as mini-batch and HYZ. In the 
mini-batch algorithm, the coordinator requests randomly, with some probability p at any time step, 
all cumulative payoff vectors at all sites. It then broadcasts the sum (across all of the sites) back to 
the sites, so that all sites have the latest cumulative payoff vector. Whenever such a communication 
does occur, the cost is 2k. We refer to this as mini-batch because it is similar in spirit to the mini- 
batch algorithms used in the stochastic optimization problems. In the HYZ algorithm, we use the 
distributed counter technique of Huang et al. [llj to maintain the (approximate) cumulative payoff 
for each expert. Whenever a counter update occurs, the coordinator must broadcast to all nodes 
to make sure they have the most current update. 

We consider two types of synthetic sequences. The first is a zig-zag sequence, with [i being the 
length of one increase/decrease. For the first \i time steps the payoff vector is always (1,0) (expert 
1 being better), then for the next 2/i, time steps, the payoff vector is (0, 1) (expert 2 is better), and 
then again for the next 2\x time-steps, payoff vector is (1,0) and so on. The zig-zag sequence is 
also the sequence used in the proof of the lower bound in Theorem [5j The second is a two-state 
Markov chain (MC) with states 1, 2 and Pr[l 2] = Pr[2 -)■ 1] = ^. While in state 1, the payoff 
vector is (1,0) and when in state 2 it is (0, 1). 

In our simulations we use T = 20000 predictions, and k = 20 sites. Fig. [2] (a) shows the 
performance of the above algorithms for the MC sequences, the results are averaged across 100 
runs, over both the randomness of the MC and the algorithms. Fig. [2] (b) shows the worst- 
case cumulative communication vs the worst-case cumulative regret trade-off for three algorithms: 
DFPL, mini-batch and HYZ, over all the described sequences. While in general it is hard to compare 
algorithms on non-stochastic inputs, our results confirm that for non-stochastic sequences inspired 
by the lower-bounds in the paper, our algorithm DFPL outperforms other related techniques. 
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A Adaptive Adversary 



This section contains a proof of Theorem [3j The proof makes use of Khinchine's inequality (see 
Appendix A. 1.14 in [2]). 

Khinchine's Inequality. Let ai,...,a n be Rademacher random variables, i.e. Pr[o"j = 1] = 
Pr[<7j = —1] = 1/2. Then for any real numbers ai, ■ ■ ■ , a n , 
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Proof of Theorem^ The adaptive adversary divides the total T time steps into T/k time blocks, 
each consisting of k time-steps. During each block of k time-steps, each of the k sites receives 
exactly 1 query. At time £ = 1, fe + l,2A; + l,..., the adversary tosses an unbiased coin. Let pu 
denote the payoff vector corresponding to heads, where p# [1] = 1 and p#[2] = 0. Similarly let px 
(corresponding to tails) be such that pr[l] = and Pr[2] = 1. For i = 1, . . . , T/k and j = 1, . . . , k, 
the adaptive adversary does the following: At time (i — l)k + j, if there was no communication on 
part of the decision maker (distributed system) between time steps (i— l)k + 1, . . . ,{i — l)k + j — 1 
- then if the coin toss at time (i — l)k + 1 was heads the payoff vector is pn, otherwise it is px- 
On the other hand if there was any communication, then the adaptive adversary tosses a random 
coin and sets the payoff vector accordingly. 

Consider the expected payoff of the algorithm: At time t = (i — l)k + j, if there was commu- 
nication between time steps {i — l)k + \ to (i — l)k + j — 1, then the adversary has chosen the 
payoff vector uniformly at random between p# and pt and hence the expected reward at time 
step t is exactly 1/2. On the other hand if there was no communication between these time steps, 
then the site j making the decision has no information about the coin toss of the adversary at time 
(i — l)j + 1, and hence the expected reward is still 1/2. Thus, the total expected reward of the 
algorithm (by linearity of expectation) is T/2. 

Note that, 
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1,2^ 
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t=l 
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£p*[l]+P*[2] 

T 

5>*[1]-P*[2])| 



t=i 



E(p < w-p*[ 2 ])i 



t=i 



(5) 



Let I C [T/k] be the indices of the blocks for which there was some communication. Consider 
blocks in / and those outside of /. Suppose the block (i — l)k + 1, . . . , ik is such that i & I, then 
| St=(i-i)fc+i P*[l] — P* [2] | = k. Note that all such block sums (as random variables) are independent 
of all other block sums. For some block (i — l)k + 1, . . . , ik such that i S I, let c(i) be such the first 
such that communication occurs at block (i — l)k + c(i). Then | ^*Z[j-_iu+i P*[l] — P* [2] | = c(i), 
also note that p* for t = (i — l)k + c(i) + 1, . . . ,ik are all based on independent coin tosses. Then 
note that, 

T k 

£p*[l]-p f [2] = £ fc7i,l + 2>(tKl+ (6) 

t=l igl iEl j=c(i)+l 
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where o~i,j are the Rademacher variables corresponding to the coin tosses of the adversary at time 
step (i — l)k + j. Also note that, 
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E IX)P*[1]-P*[2] 
Then, Khinchine's inequality and ^ gives us that 
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Now, unless |/| = (T/k)(l — o(l)), it must be the case that E[maxj =1) 2 Y2j = i P*[*]] > T/2 + Q(y/kT) 
leading to total expected regret £l(y/kT). Hence, any algorithm that achieves regret o(VkT) must 
have communication (1 — o(l))T/k. □ 



B Follow the Perturbed Leader 

Proof of Lemma^ We first note that using the given notation, the regret guarantee of FPL(ry) (see 
Fig. 0b)) is 

T 

E[iJ] <-YV|i + »7 

The above appears in the analysis of Kalai and Vempala [6]. Note that although |p*|i = p*[l]+p'[2] 
(p*[o] > in our setting), we can use the following trick. We first observe that since FPL(?7) only 
depends on the difference between the cumulative payoffs of the two experts, we may replace the 
payoff vectors p* by p*, where 

(i) if p*[l] > p*[2], p*[l] = p*[l] - p*[2] and p*[2] = (ii) if p*[l] < p*[2], p*[l] = and p*[2] = 
P'[2]-P*[l] 

Next, we observe that the regret of FPL (77) with payoff sequence p* and p* is identically dis- 
tributed, since the random choices only depend on the difference between the cumulative payoffs 
at any time. Lastly, we note that |p*|i = | p* [1] — p* [2] | , which completes the proof. □ 



C Site Prediction : Missing Proofs 
C.l Generalizing DFPL to n experts 

In this section, we generalize our DFPL algorithm for two experts to handle n experts. Lemma [2] 
showed that algorithm DFPL, in the setting of two experts, guarantees that the expected regret is 
at most Co V ^ 5 / 6 T, where Co is a universal constant. 

Our generalization follows a recursive approach. Suppose that some algorithm A can achieve 
expected regret, cq log(n)V £ 5 / 6 T with n experts, we show that we can construct algorithm A' that 
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achieves expected regret, co(log(n) + 1) with 2n experts as follows: We run 2 independent copies of 
A (say Ai and A2) such that Ai only deals with the first n experts ai, 0,2, a n and A2 with the 
rest of the experts a n +i, a2 n . Then our algorithm A' treats Ai and A2 as 2 experts and runs the 
DFPL algorithm (Section 3.1) over these two experts. The analysis for regret is straightforward: 
Let the regret for Ai be R\ and the regret for A2 be We have 

E[Payoff(Ai)l > maxVp'fil -MR\] and E [Payoff ( A 2 )] > max Vp'Kl-EfJfel. 

' ie{n+l,...,2n}£^ 

We know that E[i?i] < c log(ra)v^ 5 / 6 T and E[R 2 ] < c log(n) Vi 5 / 6 T. 
Next, we can see that 

E[Payoff(A') | Payoff (Ai), Payoff (A 2 )] > max{Payoff(Ai), Payoff (A 2 )} - c V W 6 T 

We can use the above expression to conclude (taking expectations) that 



E[Payoff(A')] > E[Payoff(Ai)] - c V >/ 6 T 
E[Payoff(A')] > E[Payoff (A 2 )] - c V >/ 6 T 



But using the above two inequalities we can conclude that 



E[Payoff(A')] < max V p*[»] - c (log(n) + l)Vl 5 ^T 

This immediately implies that for n experts (starting from base case of n = 2 where DFPL 
works), this recursive approach results in an algorithm for n experts achieves regret 0(log(n)vW 6 T). 
In order to analyze the communication, we observe that in order to implement the algorithm cor- 
rectly, when algorithm (which is DFPL at some depth in the recursion) decides to communicate 
at each time step on a block, the communication on that block is I. There are at most n copies 
of DFPL running (depth of the recursion is log(n) — 1). However, the corresponding term in the 
communication bound 0(nqT£) is lower than the term arising from blocks where communication 
occurs only at the beginning and end of block, 0({l — qn)Tk/£). Thus, the expected communication 
(in terms of number of messages) is asymptotically the same as in the case of 2 experts. If we count 
communication complexity as the cost of sending 1 real number, instead of one message, then the 
total communication cost is 0(nTk/£). 



C.2 Lower Bounds 

C.2.1 No Communication Protocol 

In the site-prediction setting, we show that any algorithm that uses no communication must achieve 
regret fl(VkT) on some sequence. The proof is quite simple, but does not follow directly from the 
Q(VT) lower-bound of the non-distributed case, because although the k sites each run a copy of 
some FPL-like algorithm, the best expert might be different across the sites. We only consider the 
case when n = 2, since we are more interested in dependence on T and k. 

Lemma 3. If no-communication protocol is used in the site-prediction model expected regret achieved 
by any algorithm is at least Q(VkT). 
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Proof. The oblivious adversary does the following: Divide T time steps into T/k blocks of size k. For 
each block, toss a random coin and set the payoff vector to be pn = (1, 0) for heads or px = (0, 1) 
for tails. And each query in a block is assigned to one site (say in a cyclic fashion). Note that the 
expected reward of any algorithm that does not use any communication is T/2. Because, no site at 
any time can perform better than random guessing. But the standard analysis shows that for the 
sequence as constructed above Efmax^i^ Ylt=i P t [ a \] ^ T/2 + £l{k^jT/k) = T/2 + Q(\fkT). □ 

C.2.2 Lower Bound using Distributed Counter 

This section contains proof of Theorem |4} 
Proof of Theorem [^j 

Part 1: The oblivious adversary decides to only use f3 out of the k sites. The adversary divides 
the input sequence into T/(3 blocks, each block of size f3. For each block, the adversary tosses an 
unbiased coin and sets the payoff vector pn = (1,0) or = (0, 1) according to whether the coin 
toss resulted in heads or tails. Let P*[o] = P* [a], where t* is largest such that t* < t and t* = Pi 
for some integer i (i.e. t* is the time at the end of the block). Note that |P*[a] — P*[a]| < P, so 
P*[a] is a valid (approximate) value of the cumulative payoff of action a. However, since the payoff 
vectors across the blocks are completely uncorrelated and each site makes a decision only once in 
each block, the expected reward at any time step t is 1/2, and overall expected reward is T/2. 

Note, that it is easy to show that Efmax^i^ J2t=i P*[*]] — T/2 + Q,(y//3T) using standard tech- 
niques. Thus the expected regret is at least £l(y/f3T). 

Part 2: Let f3 = k/10. Now consider the input sequence that is all 1. But that this is divided 
into T/k blocks of size k. For each block, the oblivious adversary chooses a random permutation 
of {1, . . . , k} and allocates the 1 to the site in that order. Note that when the site receives a 1, it is 
required to have an /3-approximate value to the current count. Suppose there was no communication 
since this site last received a query, then at that time the estimate at this site was at most ik + /3. 
Now, depending on where in the permutation the site is it may be required to have a value in any 
of the intervals [ik -(3,ik + (3\, [ik, ik + 2/3], [ik + /?, ik + 3/3], . . . , [(i + l)k - (3, (i + l)k + 2(3]. There 
are at least 5 disjoint intervals in this state and each of them are equally probable. Thus with 
probability at least 4/5, in the absence of any communication, this site fails to have the correct 
approximate estimate. 

If on the other hand, every site does communicate at least once every time it receives a query. 
The total communication is at least T. □ 

D Proof of Theorem [5] 

Proof of Theorem^ To prove Theorem [5| we construct a set of reward sequences p^,p*, ...„ and 
show that any FPL- like algorithm (as described in Section [4]), will have regret £l(\/T) on least one 
of these sequences unless the communication is essentially linear in T. 

Before we start the actual analysis, we need to introduce some more notation. First, recall 
that C is an upper bound on the amount of communication allowed in the protocol. We shall 
focus reward sequences where at any time-step exactly one of the experts receives payoff 1 and the 
other expert receives payoff 0, i.e. p* G {(0, 1), (1,0)} for any t. Let g p (t) = p*[l] — P*[2], and let 
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G p (t) = ^' =1 5 p (t)- Thus, we note that the payoff vectors p, the function g p , and the function 
G p all encode equivalent information regarding payoffs as a function of time. 

Suppose, A is an algorithm that achieves optimal regret under the communication bound C. 
Let r denote the random coin tosses used by, A. Thus we may think of r as being a string of length 
poly(n, k)T fixed ahead of time. Let p 1 , p T be a specific input sequence. Let T±, T2, ■ ■ ■ , 2fo 
denote the time-steps when communication occurs. We note that Tj may depend on rj which is 
a prefix of the (random) string r, which the algorithm observes until time-step Tj and may also 
depend on the payoff vectors p 1 , . . . , p Ti . 

Next, we describe the set of reward sequences to "fool" the algorithm. Let A be a parameter 
that will be fixed later. We construct up to (T/ (2A)) + 1 possible payoff sequences. We denote this 
payoff sequences as P(o)> P(i)j ■ • • > P(T/(2A))+i- These sequences are constructed as follows: 

• p( ): Let g + denote a sequence of A consecutive l's and g~ denote a sequence of A consecutive 
— l's. Then the sequence (g Pw (t))t<T is defined to be the sequence g~ , g + , g + , g~ , g~ , i.e. 
g p m(t) = -1 if \(t - 1)/A] is even and g p ^(t) = 1 if \(t - 1)/A] is odd. Furthermore, 
we assume that T = (4m\ + 3) A for some integer m\. This means that G p (°)(T) = A, i.e. 
eventually expert 1 will be the better expert. 

• p(j) for i > and i even: In this payoff sequence, the payoff vectors for the first (2i — 1)A 
time-steps will be identical to those in po . For the rest of the time-steps the payoff vector will 
always be {(1,0)}, i.e. the first expert always receives a unit payoff for t > {2i — 1)A. Thus, 
for sequences of this form, where i is even, expert 1 will be the better expert. 

• p(j) for i > and i odd: In this payoff sequence, the payoff vectors for the first (2i — 1)A 
time-steps will be identical to P(o)- For the rest of the time-steps, the payoff vector will always 
be {(0, 1)}, i.e. the second expert always receives a unit payoff after t > (2i — 1)A. Thus, for 
sequences of this form, where i is odd, expert 2 will be the better expert. 

Furthermore, in what follows, we assume that there is only one site node. (This is not a problem, 
since worst adversary could send all the payoff vectors to just one of the site nodes.) We shall refer 
to the i-th cycle of the input in the above sequences as the input between time steps (4i + 2) A — 
("v/T/2) + 1 and (4i + 4)A + (vT/2). Let F i be an indicator random variable (depending on the 
randomness r of the algorithm), such that F l = 0, if there is some communication between the 
time steps 2i\ + VT/2 and (2i + 2)A — \fi/2. If there is no communication, we will set F l = 1. 

Now, we prove the main result using a series of claims. First, we show add a few extra com- 
munication points, showing that this only increases the payoff of the algorithm (hence decreases 
regret). Let X = {i \ F 2% = F 2%+1 = F 2%+2 = 0}. Note that X itself is a random variable. For 
every i G X, we allow extra communication to the algorithm (for free) at the end of the following 
time-steps: (4i + 2)A - y/f/2 (4i + 2)A + VT/2, (4i + 4)A - Vf/2, and {4i + 4)VT/2. Note, that 
this extra communication can only increase the payoff, precisely because F 2% = F 2t+l = F 2%+2 = 0. 
This extra communication is given for free, thus this is favorable to the trade-off of the algorithm. 
Despite this we will show that even the regret of this algorithm has to be large. This is done by a 
series of claims. Each of which are proved as lemmas subsequently. 

Claim A Let R^ l \l,T) denote the (random variable) regret of playing according to algo- 
rithm, A, against payoff sequence, p(j) using randomness r, between time-steps 1 and T. 

Then, if E[^ (,) (1,T)] = O(VT) for all 1 < i < T/(2A), then E[|X|] > This fact is proved 
in Lemma 01 

Claim B Suppose, i £ X, and let C(i) be the communication during the i th cycle. Then we 
can state the following regarding the payoff on the rounds with respect to sequence p(o) within 
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the i th cycle. Here Co is some absolute constant. 

Payoff^ ((4i + 2)A - Vf/2 + 1, (4i + 4) A + Vt/2) < A + Vr/2 - 
This fact is proved in Lemma [5j 

Claim C Let t be a point such that communication happened just after time step t. Let r > t 
be a point such that G(t) = G(t). Then Payoff^ (0) (i + l,r) < (r — i)/2. This fact is proved 
in Lemma [H 

Now, let us calculate the regret of the algorithm. If the expected regret of the algorithm with 
respect to sequence P(j\ for i > 0, is at most 0(Vf), then it must be the case that E[|X|] > T/(4A) 
(using Claim A above). Now, we assumed that in the sequence P(q), expert 1 eventually wins. Let 
Z = {ii, . . . , ik}, where i\ < 12 < • • • < ik and E[fc] > Tf (4A). Then, we add up the payoff of the 
algorithm as follows. First, (using Claim B above) notice that: 

fP(°)^„- , ow *fiF/o , 1 (as. 1 a\\ 1 ,/^/o\i ^ x , c ^ 



E[PayoffP (0) ((4ij + 2)A - Vf/2 + 1, {Aij + 4)A + Vf/2)} < A + Vf 2 - (7) 



Then let Bj denote the interval, ((4i 3 - + 4)A + \/T/2 + 1, (4i, + i + 2)A - a/T/2), i.e. between the i th 
and the cycle. Also, let B denote (VT/2 + 1, (4n +2) A - V^/2) be the interval before the first 
cycle in I, and let B^ = ((4i^ + 4)A + Vf/2 + 1, T — A — Vf/2) denote the interval after the last 
cycle. Now, using Claim C above, we get that the payoff received by algorithms in any interval 
Bj is half the length of the interval. Thus, the only time-steps that we have not accounted for is 
(1, Vf/2) and (T — A — Vf/2 + 1, T). The total number of time-steps in these two intervals is A. 
Let us give the algorithm payoff A for free on these time steps. Then, adding up everything and 
the payoff of the algorithm, Payoff^ (0) is a random variable defined over the space measurable by 
{F%> and C 



Payoffr(l,r)<^-£ Q,V/ 



2 ' 2 ^ C(ij) 



Thus, we get 
E[< (0) I {F%> ,C} > E 



> E 



> c - 



\i\ 2 Vf 
c 



(I is measurable by {F 1 } 



i>l, 



{F l }i>o, C 



A 
2 



I\WT A 



C 



(I is measurable by {F l }i 



>o 



We use Jensen's inequality and the fact that C > X^ez^W *° S e t the last inequality. Finally, 
using Claim A and by setting A appropriately, we get 



E[B%°\l,T)} > coT^-'^lQC 



1.5-2ei- 



□ 
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We now prove the Lemmas mentioned in the above proof. 
Lemma 4. IfE[R% l) (l,T)\ = 0(y/T) for alll<i<^, then E[\l\] > ^. 

Proof. Our crucial observation here is that when the random tosses of the algorithm is fixed, 
the algorithm will have identical behavior against the reward sequences pm) and p( m ) for any 
1 < m < ^ up to time 2mA — A. Thus, if we couple the process for executing A against p/Q) with 
the one for executing A against P( m ) with the same random tosses in the algorithm, we are able 
to relate the random variables {i ?4 }i>o with the regrets for other reward sequences. Specifically, it 
is not difficult to see that 



E[i£ (m) (l,2mA + 1) | {F 1 }^] > c max { 

i odd 



m-2 

(1 - F^F™- 1 [ Yl F j ) }> • A (8) 

j=i+i 



when m is odd and 



m-2 



E[R P ^\l,2m\ + l) | {FVUcomax^l-nr- 1 [ J] & ) }■ A (9) 



when m is even. 

We may then use this observation to prove Lemma [5} Let m be an arbitrary number. We shall 
show that Pr[m € X] > \. 

Let us define the event 5(s) be the event so that the suffix of {F % }i<i< m is s. For example, 
£(000) represents the event that F m ~ 2 = F" 1 " 1 = F m = 0. Let partition the probability space into 
the following events: 

£ (000), £ (001), £ (010), £ (011), £ (0100), £ (01100), £ (11100), £ (101), £ (0110), 5(1110), and 5(111). 

Furthermore, we let £q(01100) be the subset of 5(01100) such that the last zero in the sequence 
F°, ...,F m ~ 5 has an even index. And let £i (01100) = 5(01100) - 5 (01100). Similarly, we let 

• 5o(1110) be the subset of 5(1110) such that the last zero in the sequence F°, ...,F m ~ 4 has an 
even index; let 5i(1110) = 5(1110) - 5 (1110) 

• 5o(lll) be the subset of 5(111) such that the last zero in the sequence F°, ...,F m ~ 3 has an 
even index; let 5i(lll) = 5(111) - 5 (111) 

Now the whole probability space can be partitioned into the following events: 5(000), 5(001), 
5(010), 5(011), 5(0100), 5(01100), 5 (11100), £i (11100) 5(101), 5(0110), 5 (1110),5i(1110) 5 (111), £i(lll) 

Let €2 be an arbitrary constant such that < e% < e\. It is not difficult to see that if any of 
the events above, except for 5(000), happens with probability at least T~ €2 , then one of pt will 
have lo(VT) regret. We will just examine one event to illustrate the idea. The rest of them can be 
verified in a similar way. Suppose Pr [5(001)] > T~ 62 , we have 

E[R% n - 1 (l,T)] > E[R^ n ~ 1 (l,T) | 5(001)] Pr[5 (001)] 
> Efi^" 1 - 1 (1, T) | 5(001)] Pr[5(001)] 
= uj(Vf) (By §) and ©). 



Thus, we can conclude that Pr[5(000)] > 1 — 13T £2 > \ for sufficiently large T, which concludes 
our proof. □ 
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Lemma 5. Let i El, and let C(i) denote the communication in the i th cycle. Then, 



E [Payoff %°\(4i + 2)A - Vf/2 + 1, (4i + 4) A + Vf/2)} < A + Vf/2 - 

Proof. Actually, using Lemma^t is easy to see that E[Payoff^ (0) ((4i + 2) A + Vf/2 + 1, (4i + 4)A - 
Vf/2)} < A - Vf/2. Now, let us consider the interval, ((4i + 2)A - Vf/2 + 1, (4i + 2)A + Vf/2). 
Let To = (4i + 2) A — Vf/2, T\, . . . , T c = (4i + 2) A + Vf/2, be the time-steps when communication 
occurs. Note that the communication at time-steps To and T c is for free, and that c < C(i). Let 
w(x) denote the probability of picking the first expert according to follow the perturbed leader 
(FPL), if the x is the difference between the cumulative payoff of the first and second expert so far. 
Thus, if x = —Vf, w(x) = and if x = Vf, w(x) = 1. We have, 



w{x) = < 

^0 x < -VT 

Then, we have 

E [Payoff P /\(4i + 2)A - Vf/2 + 1, (4i + 2)A + Vf/2)} = ^ w(G p W (Tj))(Tj + i - Tj) 

i=o 

We use the following claim (which is an exercise in simple calculus) to complete the proof. 

Claim 1. Let f : [a, b} —> M + be an increasing function such that f'(x) > L on [a, b}. Let xq = a < 
x\ < ■ ■ • x c = b, then 

c_1 rb t fh „\2 



x > VT 
l-Hl-4;) 2 0<X< Vf 



2 V VT, 
H 1 + v%) 2 ~Vf<x<0 



^2 f( x j)( x j+i - x j) < / fi x )dx - — — — 

3=0 Ja ° 



Now, notice that G p (°)(T ) = -y/f/2, G p(0) (T c ) = Vf/2, and J % 2 /0 w(x)dx = Vf/2. Also, 

V 1 j 1 



> (0) (t c ) = Vf /n ™ A fv/ ^ //2 

w'(x) > l/(2i/T). Thus, applying the above claim, we get 

E[PayoffP (0) ((4i + 2)A - Vf/2 + 1, (4t + 2) A + Vf/2)} = ^ w(G p ^ (T 3 ))(T j+1 - Tj) < Vf/2 - 
Similarly, we can prove that. 

E[Payoff^ 0) ((4i + 4) A - Vf/2 + 1, (4i + 4)A + Vf/2)} = £ W (G P (»> (T,-))(T ; - +1 - Tj) < Vf/2 - ^ 

Adding up across the three intervals, we can complete the proof the lemma. □ 

Finally, we prove the following: 

Lemma 6. Let {T}j>i be point where communication occurs in the algorithm A. Pick some Tj 
and let t > T i; be such that G p (°)(r) = G p (°)(T). Then, Payoff^ (0) (T + 1,t) < (T - t)/2. 
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original sequence: po 




new sequence: p'. 




set p'* = p l for all t < T { 




set t = Ti + 1 




for j = 1 . . . , £ - 1 




for p = G Po (T j + l)... 


G p °(T j+1 ) 


set G p '{v) = p,t = 


t + 1 


set t(j + 1) = t 




for p = G Po (T e + l)...G p 


>(r) 


set G p '(v) = p,t = t + 


1, 


set t' = t. 





Figure 3: Algorithm to construct a sequence in Lemma [6] 

Proof. We will instead show that E[i?^ t0) (Tj + l,r)] > and observe that both experts have equal 
payoffs in the time-steps (T + l,r) since, G^ (0 (T,) = G^ (0 (r). 
We shall construct a new reward sequence p' such that 

• p '* = p * for all t < Ti. 

• There exists a r' > Tj such that 

P /T ' = P5 = Po 1 and ^(T; + 1, r') < Ei?P (T + 1, r). 

In other words, we first construct a new sequence. Then we argue that the local regret by using 
Full over the new sequence is better than the original regret. Here, Full is an implementation 
of FPL that communicates at every time step (essentially a non-distributed version). Finally, it is 
not difficult to see that Ei?p ull (T + l,i J ) > because G p (Ti + 1) = G p (r'), which would complete 
the proof of the Lemma. 

Let Ti be the largest communicated time step that is no larger than r. We use the algorithmic 
procedure described in Figure [3] to construct the new sequence. Notice that our construction gives 
the function G p , which indirectly gives p'. 

Roughly speaking, our new p' uses the "shortest path" to connect between G(Tj) and G(Tj + i) 
for all Tj between Tj and Tg. Then p' is concatenated with another "shortest path" from Tg to r. 
For the purpose of our analysis, we also let t(j) be the new time step in p' that corresponds with 
the old Tj in po- We shall prove the following two statements, 

• For any i < j < £ — 1, 

E[i?P (T,- + l,T j+1 ) | {Tj}j>i] > E< u (t(i) + l,t(j + 1)). (10) 

• Also, 

E[i?P°(T, + l,r) | {T}i>i] > ER^m + l,r'). (11) 
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One can see that these two statements are sufficient to prove our claim: 



E[R P A (Ti + 1,t) I {Tj}j>i] > ^E[< u (t(j) + l,t(j + 1))] +E[< J1 (^) + l,r')] 

i=i 

> 0. 



We now move to prove (10) and(ll). Specifically, we only demonstrate the proof of (10) and the 



proof for (11) would be similar. 



Without loss of generality, we may assume that T 



j + i — Tj < 4A for any i < j < I — 1 since if 
within one whole cycle there is no communication, the expected regret for this cycle is 0. 

We consider the following three cases. 
Case 1. Tj and 3}+i are on the same slope of a cycle (i.e. G(t) is monotonic between Tj and 2}+i). 



In this case, t(j + 1) — t(j) 
is always better on p'. 



Tj. With straightforward calculation, we can see that Full 



Case 2. There is only one zig-turn (namely, at time T z ) between Tj and TV+i- Furthermore, we may 
assume \T X — Tj\ > \T Z — Tj + \ \ . The other case can be proved similarly. Let Tj +1 = T z — \T Z — | . 
The crucial observation here is that G p (Tj +1 ) = G p (Tj + i). Since there is no communication 
between time Tj + \ + 1 and Tj +1 , the expected regret in this region is 0, i.e. 

E[R P A (T> +1 ,T J+1 )\{T} 1 > 1 ]=0. 

On the other hand, since Tj +1 and Tj are on the same slope, running a full communication algorithm 
is strictly better between Tj and T'- +1 Finally, notice that the sub-interval G p> (t(j) + l), ...G p ' (t(j + 
1)) is identical to G p (Tj + 1), G p (Tj +1 ) by construction, we have 

E[< u (t(j) + l,t(j + 1))] > E[R p A (Tj + l,Tj +1 )] = E[R p A (Tj + l,T j+1 )}. 



Case 3. There are two zig-turns (namely T z and T z i) between Tj and Tj + \. Let Tj = 2T Z — Tj and 
Tj +1 = 2T Z / —Tj + \. Without loss of generality, let us assume that Tj < T'- +l . Our observation here 
is that the expected regret between Tj + 1 and Tj for A is 0. Furthermore, the expected regret 
between T'- +l + 1 and Tj + \ is also 0. Then we can apply the arguments appeared in Case 2 again 
here to show that running Full for the intervals Tj + 1 and Tj +1 is strictly better than running A. 

Then we can conclude that E[R^ ull {t(j) + 1, t(j + 1))] > E[R^(Tj + 1, T j+1 )] for this case as well. 

□ 
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